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Abstract 

Linear models have found widespread use in statistical investigations. For every linear 
model there exists a matrix representation for which the ReML (Restricted Maximum Like- 
lihood) can be constructed from the elements of the corresponding matrix. This method 
works in the standard manner when the covariance structure is non-singular. It can also be 
used in the case where the covariance structure is singular, because the method identifies 
particular non-stochastic linear combinations of the observations which must be constrained 
to zero. In order to use this method, the Cholesky decomposition has to be generalized 
to symmetric and indefinite matrices using complex arithmetic methods. This method is 
applied to the problem of determining the spatial size (vertex) for the Higgs Boson decay 
in the Higgs — > 4 lepton channel. A comparison based on the x 2 variable from the vertex 
fit for Higgs signal and tt background is presented and shows that the background can be 
greatly suppressed using the x 2 variable. One of the maior ad vantages of this method over 
the currently adopted technique of b-tagging ( Tomalin . 20081 ) is that it is not affected by 
multiple interactions (pile up). 

Keywords: Higgs Boson, Vertexing, b-tagging, ReML, Cholesky Algorithm, pile up, Singular Error 
Matrices 
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1 Introduction 



(ISmithl . Il995l ) describes how to efficiently compute both forward and backward derivatives of th e 
Cholesky decomposition by using methods taken from automatic differentiation (jGriewankl . 120081 ) . 
This technique permits variance-covariance estimation by restricted maximum likelihood (ReML). 
While the Cholesky decomposition and its derivat ives are finding applicat ions with ReML and in 
linear models typical to studies of animal breeding (IMeyer and Smith! Il995l ). these are coming with 
additional innovations (Meyer, 2001; Meyer and Kirkpatrick, 2005). The computational methods 
have been introduced in general statistical packages, including SAS (SAS Institute, 2009; Meyer, 
2007). 

Outside of ReML, the Cholesky decompo sition and its deri vatives are finding the following 



applications: in spatial modeling or kriging ( iToal. et all 120091 ); in latt i ce mo dels with an ex- 
ample showing non-parametric curve fitt ing and cross - valida tion ( Smith! . 1997 ); in optimization 



involving the inverse diffusion problem (jChristiansonl . 119971 ); to differentiate a Laplace approx- 
imation of_j^Jikehh^dJunction, thereby permitting estimations of parameters in a population 
model (jFrimannslundl . 120061 ) ; to permit Hamilto nian Mar k ov Ch ain Monte Carlo in the context of 



non-linear regression by function factorization (ISchmidtl. 120091); in calcu 
associated with exposure risk of financial portfolios (ICapriotti and Giles! . 



ating price sensitivities 



20101 ); a nd for optimiz - 



i ng several hyper-para meters within a gradient-based machine learning algorithm ( iBengid . l2000l ). 



(IDeHoog. et al 



20 111 ) have introduced a general notation for treating that task involved with 



calculating derivatives of functions that depend on triangular matrix factors (including Cholesky's 
factor), and they describe several application areas too. 

The above applications consider only the Cholesky decomposition of a positive-definite (or 
semi-definite) matrix. This convention is not followed in (Smith, 2001a and 2001b) where consid- 
eration is given to a linear state-space model, nor is it followed in the present paper. Smith used 
the Cholesky decomposition (and it derivatives) to estimate second moments, but in this different 
application the Cholesky decomposition was applied to a symmetric and indefinite matrix. In 
general, the Cholesky decomposition may not be possible for an indefinite matrix, in which case 
an attempted decomposition may lead to an interruption. Such interruption is possible when a 
linear model includes effects that come with a singular variance-covariance matrix structure as is 
possible with state-space models. It has been the case that an interrupted Cholesky decomposition 
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permits the correct ReML likelihood calculation following (Smith, 2001a and 2001b). However, 
Smith's approach implies that the linear model is consistent with the singular variance structure, 
and in general this assumption cannot be made. The purpose of the present paper is to get be- 
yond this limitation by introducing linear constraints that guarantee consistency, and hence this 
enlarges the set of cases where Cholesky-based R eML can be ap plied. There are also some minor 



typographical errors in the pseudocode listed in (ISmith 



2001al ). and the corrected pseudocode is 



presented in the Appendix (Section [TTj) of this paper. Furthermore, it is the additional purpose of 
this paper to get beyond the common ReML applications and demonstrate a new Cholesky-based 
discrimination method that is suitable for enriching data samples that are used in the search for 
the Higgs Boson ("Higgs Hunting") at the Large Hadron Collider (LHC) 

The mixed linear model and ReML are reviewed in Section [2j In Section [31 Likelihood eval- 
uation is cast in terms of residual error, and the Cholesky decomposition of a symmetric and 
indefinite matrix. Section H] provides mathematical justifications for the constraints that may be 
needed when the Cholesky decomposition is interrupted. To treat the possibility of constraints, 
Section [5] describes optimization by Lagrange multipliers and Section [6] describes optimization by 
penalized maximum likelihood. Both methods are readily adapted to the Cholesky decomposition 
and its derivatives. Section [7] treats estimation of fixed effects as an auxiliary calculation to the 
Cholesky decomposition by solving an indefinite system of equations. Section [8] presents an exam- 
ple coming from experimental physics, where the ReML method is used to devise a discrimination 
function to reduce backgrounds and preserve signal events in order to enrich data samples used 
for Higgs Hunting. The conclusion follows in Section [9j 



2 Mixed Linear Models, Log-Likelihood and ReML 

Notation: In the following Sections, we denote the transpose of a matrix R by R'. We also denote 
column vector in component form by square brackets, v = [a, b, c, ..], and the corresponding row 
vectors with parentheses, v' = (a, b, c, ...). 

Linear models, including mixed linear models, have found widespread use in statistical investi- 
gations. The linear model, though additive, is frequently flexible enough for real situations as an 
approximation around the mean. Also, linear models and the associated normality assumptions 
are well understood. Methods as old as the analysis of variance (ANOVA) are completely consis- 
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tent with mixed model methods. Furthermore, while the theory is developing in new areas, such 
the Gibbs Sampler or with other Bayesian methods, mixed model methods benefit as new tools 
come along. The mixed linear model is represented by: 



y = X/3 + Zu + e, 



where y is a vector of observations, (3 is a vector of fixed effects, u is vector of random effects 
and e is the observational error. The matrices X and Z are incidence matrices that relate the 
various effects to observations. The first moments for the random effects (their expectations) are 
E[u] =0 and E[e] = 0, and the variance-covariance structure is given by var[u] = G, var[e] = R 
and cov[u, e] = 0. Additional assumptions are needed to implement maximum likelihood or 
computer simulati on, and generally y, u, and e are taken as multivariate normal. As indicated in 
(iGoldbergerl . Il962l ). the Best Linear Unbiased Prediction (BLUP) of u is found by evaluating 



u = GZ / V- 1 [y-X/3], 



where 



V = var(y) = ZGZ' + R (1) 

and where /3 is the Best Linear Unbiased Estimate (BLUE) of the fixed effects obtained by the 
Generalized Least Squares (GSE) problem 



(X'V^X) 0} = [X'v-V]- 



(2) 



These equations c an be reformulated so th at the solutions can be obtained directly from the mixed 



model equations ([Henderson, et al. 



1959 



X'R- X X XR *Z 
Z'R" 1 X Z'R^Z + G 1 





(3) 



Howe ver, Section [7] (see below) provides and alternative method based on the method of (jSiegell . 

It is well known that if a non-informative prior is used to describe the fixed effects in a 
Bayesian context, the posterior distribution (conditional on y) of a linear combination of b (where 
b = \J3', u']), say Hb, is multivariate normal. In this case the posterior distribution has mean 
vector Hb (where b = \fl', u']) and variance-covariance matrix given by HC _1 H', where C is the 



2-by-2 partitioned matrix on the Left Hand Side (LHS) of Eq. ((3]). Therefore the mixed model 
equations are only good when the inverses R _1 and G -1 both exist. If either Rr 1 or G -1 does 
not exist, then C does not exist. Therefore, in the case where the covariance structure is singular, 
the mixed model equations will not apply. 

The log-likelihood for the Multivariate Normal (MN) is given by 

ln(MN) = constant - - In |V| - -(y - X/^'V^y - X/3) 

2 2 

The maximum likelihood estimates of /3 and the dispersion parameters (R and G) are found by 
maximizing the log-likelihood. Estimates of the dispersion parameters can be badly biased by 
small-sample errors induced by the estimation of ft. This is a serious problem when the dimension 
of /3 is large relative to the inf ormation available to estimate . 

To overcome this problem (IPatterson and Thompson! . Il97ll ) introduced Restricted Maximum 
Likelihood (ReML), where the dispersion parameters are found by maximizing 



ln(ReML) = constant - - In |V| 



In |X'VX| - -(y - X^'V-^y - X/3), 



(4) 



where (3 is the solution obtained by GSE, Eq. ([2]). ReML has the advantage of estimating away the 
/3 parameters from the Likelihood. This is especially useful in cases where one wants to concentrate 
on minimizing the deviations from a common mean, without explicitly finding that common mean. 
(jWikipedial . 1201 ll ) states, "In contrast to conventional maximum likeli hood estimation , ReML can 
produce unbiased estimates of variance and covariance parameters." (jHarvillel . Il974f ) derived the 
likelihood in Eq. fll]) to treat the "error contrasts" which are found by taking a complete set of 
linear combinations of the observations which are sufficient to remove the effect of /3 while leaving 
the maxi mal amount of information for the purpose of ReML. An early review of ReML can be 
found in (jHarvilld . 119771 ) . More reviews can be found in (Speed, 1977 and 1995). The relevance for 
the particular application described in Section [8] (see below), is that the \ 2 that is determined by 
ReML is independent of the central vertex coordinates. It is possible to back-out the coordinates 
of the fitted vertex, but it is not necessary to know the coordinates in order to determine the 
goodness of fit. 
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3 Applying ReML to the Z = Scenario 



Consider the following linear model £ in the case where Z = 

£ : y = XP + e, 

where y is an independent observation vector, /3 is a vector of fixed effects which are to be removed 
using the ReML likelihood, and e is a vector of random residuals. The array X is the incidence 
matrix that assigns fixed effects to observations. The variance-covariance matrix of the random 
residuals is denoted by R 

var[e] = R. 

The linear model £ is now fully specified and the error matrix satisfies V = R so that the likelihood 
given by Eq. fflh reduces to 



ln(ReML) — > constant — — In |R| 



In |X'RX| - -(y - X^)'R _1 (y - X/3). 



(5) 



(ISmithl . l2001bl ) associates a symmetric matrix K with the above linear model £ as follows: 



R X y ^ 



K 



X' 

y' o o 



(6) 



/ 



where the O's denote the appropriate-sized null square matrices required to fill out the rows and 
columns of K and X' is the transpose of X. Therefore, £ implies the existence of a matrix K, 
and K implies there exists a model £. Our main interest in Eq. ([6]) is that R is not required 
to be invertible. The method we use is able to identify those linear combination of y which are 
associated with the zero-eigenvalues of R. These linear combinations are non-stochastic and can 
be eliminated from the stochastic part of the likelihood and treated by the method of Lagrange 
constraints. In other words, our method is able to find the natural constraints required for the 
maximum likelihood problem for ReML. 

It is well known that the Cholesky d ecomposition r uns to completion with any matrix that 



is symmetric and non-negative definite. ((Smith 



2001bl ) shows that the Cholesky decomposition 



can also be performed on the matrix K, and this is most curious because while K is symmetric, 
it is not non-negative definite. K is classified as indefinite. However, the rows and columns of 
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K must be first permuted, leaving the last row and column in place as required. This is enough 
to permit computation of the Cholesky decomposition, or the lower triangular matrix L, where 
Kkxk = LL' (for some permutation involving the first k — 1 rows and columns). (Smith, 2000 
and 2001a) describes the ReML function, or the likelihood that removes the impacts of /3, to be 
a function of this particular L. This calculation is given neatly as follows: 



where the absolute modulus function \L iyi \ transforms a possible imaginary number into a positive 
number, and the summation is only over the non-zero pivots. When zero pivots are encountered 
(when Lij = for some i) we require that the i-th column of L vanishes and this is enough to 
justify likelihood evaluation by the above formula. It is easy to show that this calculation agrees 
with the likelihood given in Eq. (jSJ) when R is non-singular. However, in the most general case 
we cannot simply skip the zero pivots, and more will be said about this in the Section HI because 
particular constraints must be imposed. 

The likelihood function is derived from elements of the Cholesky decomposition, and so there is 
nothing else that is needed to perform ReML but to find the derivatives that permit the likelihood 
to be optimized by the iterative Newton-Raphson technique. These derivatives come automatically 
with the Cholesky decomposition (Smith, 2000 and 2001a), and so there is little beyond the matrix 
K, and its decomposition, that must be considered to describe ReML. The corrected pseudocode 
for the differentiation of the Cholesky algorithm and directions for how to use it are found in the 
Appendix (see Section [TT]) . 

4 Justification and Additional Constraints 

The Cholesky decomposition of a matrix Kkxk proceeds by identifying a non-zero diagonal element 
(the first pivot), and then permuting rows and columns of K to reposition that diagonal to the first 
diagonal position. Elementary row operations are now conducted to annihilate elements below 
the diagonal in the first column (this is pivoting). The first row of K is now transformed into the 
first column of L by replacing the first diagonal by its square root and by dividing the remaining 
elements in the first row by that square root. The first row and column of K are now deleted to 
produce a smaller sub-matrix of order k — 1. This sub-matrix is called the Schur complement. An 




i<k 
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outline computation of the first Schur complement (of K\\ in K) is displayed below. 

/ K n K12 ... K lk \ ( K 22 K 23 ■■■ K 2k 

K 32 K 33 ... K 3k 



K 21 K 



22 



K 



2k 



\ Kik K k2 ... K kk J 



Kn 



31 



(K 12 ,Ki 3 , ...,Ki k ) 



V K kl ) 



\ K k2 K k3 ... K kk ) 

When K is symmetric, the Schur complement is symmetric. Therefore, the Cholesky decompo- 
sition may proceed by half-storing K. The above steps are now repeated to generate a second 
Schur complement, then a third and so on. The algorithm may be organized to overwrite the 
half-stored K with L. The operations of the Cholesky decomposition are reversible, and therefore, 
the Cholesky deco mposition conserves information. 

( jSmithl . |2001b|) generalized the Cholesky decomposition for the case when K is symmetric and 
indefinite by using a complex representation for the Cholesky diagonals if needed. The sub-matrix 
initially containing the zero entries is the non-positive definite partition, and with fill-in (generated 
from pivoting from the diagonals of R) the non-positive definite partition becomes more negative. 
When pivoting switches over to the diagonals of the non-positive definite partition, then fill-in 
results in the partition initially set to R (or the non-negative definite partition) thereby making it 
more positive. If a few pivots are selected from the diagonals of the non-negative definite partition 
first, before switching over to the non-positive partition and continuing pivoting over the negative 
diagonals until the non-positive definit e partition ret urns to a matrix containing zero in all its 



200 lbl ) referred to the particular pivot order as a 



entries (expect the last diagonal), then ( jSmithl . 
standard data reduction. 

After the first pivot step within the Cholesky decomposition, the lead row and column is 
removed from K, and this reduces the dimension of the resulting Schur complement by 1 as noted 
above. After a standard data reduction, the Schur complement is again reduced in dimension from 
K but retains the special matrix form (ignoring the last diagonal with no loss in generality): 



( Rx Xx yi ^ 



X'x 













/ 



where the subscript indicates that K x was generated from K following the Cholesky decomposition. 
We will denote K as Kq, to maintain consistency. In shorthand, this transformation is denoted 
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by K n ->K n. 

flSmithl . l2001bl ) notes that K x also represents a model C±, given by 



C 1 :y 1 = Xi/3i + ei, 



where /?i is sub- vector of /3 and the variance-covariance matrix of ei is Ri. The model L\ is 
fully determined given the Schur complement Ki. Moreover, those observations that have been 
processed and eliminated from K are uncorrelated with e\. Also, the extracted information was 
accounted for in the previously constructed Cholesky diagonal elements. In other words, the 
standard data reduction separates the data into two statistically independent parts. The first 
statistically independent part is used to compute that block of the log-likelihood already known 
for ReML and consistent with Eq. §5§. The remaining part pertains to the treatment of C\, but 
because Ki is in the form of K, the process can be iterated and the Cholesky decomposition 
continued to the next pivot. 

The Cholesky decomposition signifies a tower of standard data reductions: 



At each transition the dimensions of the Schur complement get smaller and smaller, and all along 
the way information is being processed correctly to evaluate the ReML likelihood. The only 
question is whether the Cholesky decomposition completes and leads to a final Schur complement 
that folds into the likelihood function calculation. However, the last Schur complement may be 
one of two special forms which cannot be reduced further. These are the First Special Form: 



K ->■ Kx -> K 2 -> K 3 -> K 4 



And these Schur complements correspond to a tower of models: 



C ->■ £1 -> C 2 -> £ 3 -> A 



/ 







H 



H' 











V 







or the Second Special Form: 
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In the case of an incomplete Cholesky decomposition, what is made manifest are the constraints 
that must also be imposed on the maximization of the likelihood function. The remarkable con- 
clusion is that all the information is contained within the Cholesky decomposition (as we will see), 
even when the Cholesky decomposition is unable to finish. 

One might question the appropriateness of the standard data reduction, given that the pivot 
order in the Cholesk y decompositio n may be dynamic and may not follow the standard data 



reduction. However, ( iSmithl . l2001bl ) proved that pivot orders come in equivalence classes. Any 
dynamic order corresponds to a tower of standard data reductions where the ReML likelihood is 
treated correctly. And moreover, Schur complements are invariant to the pivot order that generates 
them (including the last diagonal of the Cholesky decomposition), as well as the determinant 
calculated as the product of pivots. The correct likelihood is calculated even when the standard 
data reduction is not followed, because implicit in any pivot order is a tower of standard data 
reductions. 

With dynamical pivoting, the Cholesky decomposition is permitted to go as far as it can while 
skipping zero diagonals that would otherwise be pivots. Some zero pivots may encounter fill-in 
during the computation, and this permits the Cholesky decomposition to continue with additional 
rounds of pivoting. If for some reason the Cholesky decomposition in unable to complete the 
operations, the matrix that remains (as unfinished) will be a sub-matrix of what had been K and 
what was becoming L (but never completed). The unfinished sub-matrix contains the last Schur 
complement either in the First or Second Special Forms (noted above), even if the pivot order did 
not follow a tower of standard data reductions. 

If extraneous parameters remain impacting the likelihood function, they are found involved in 
non-stochastic linear combinations, and these combinations are revealed in the Schur complement 
(the First Special Form containing both H and v) that could not be reduced by the Cholesky 
decomposition: 

H/3 : = v, 

where 0i is a sub-vector of /3, and v is a revealed linear combination of y. If the extraneous 
parameters are no longer present, then the matrix H goes away. We are left with the Second 
Special Form of the Schur complement that only involves v equated to a column vector of zeros: 

v = 0. 
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When the Cholesky decomposition is unable to finish, then one of these systems of linear equations 
becomes a side condition (a constraint) for the likelihood maximization exercise. 

We may be uninterested in the extraneous parameters. However, we are interested in consistent 
models that don't contradict themselves when linear combinations are made of their components. 
This concern is quite independent of the extraneous parameters as we will see. The set of non- 
stochastic equations that are revealed (by the First Special Form of the Schur complement) will 
involve extraneous parameters, and these will be ignored in as much as ReML removes their 
impacts from the likelihood. The revealed set of equations can imply whatever they want about 
/3i and ReML will ignore them. This assumes that the statistical model is already consistent. 
However, the non-extraneous parameters that are identified for likeli hood evaluatio n need not 
conform to a consistent linear model and that's where we depart from (jSmithl . l2001bl ). 

Likewise, if the attempt at Cholesky decomposition ends with a Schur Complement of the 
Second Special Form (leading to the equation v = 0), what is revealed are constraints that must 
be imposed to guarantee a consistent linear model. In this case there are no extraneous parameters 
to confuse the issue. 



5 Constraints: Lagrange Multiplier Method 

In the event that the Cholesky decomposition encounters a zero and is unable to complete, in 
order to to maintain consistency one can consider appending to the ReML likelihood a Lagrange 
multiplier expression according to one for the following cases: 

Case 1 : iq(L, A) = constant - y^ln \L iA \ + -L\ k + A'(H/?i - v), 

%<k 

where maximization proceeds with respect to the non-extraneous parameters (si, S2, S3, S4), f3\ 
and A. Therefore, the side condition enforces the restriction that v is in the column space of H 
independent of fa. 

Case 2 : F 2 (L, A) = constant — In \ L^i\ + -L\ k + A'v, 

i<k 

where maximization proceeds with respect to the non-extraneous parameters (s\, S2, S3, S4) and A. 
Therefore, the side conditions enforces the restriction that v = 0. 
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In both cases, we find an objective function so constructed from the known element s of the 
unfinished Cholesky decomposition. First and second derivatives are available (jSmithl . l2001a| ) 
and constrained optimization is straightforward by the technique of Newton-Raphson iteration as 
applied in Section [8] below. 

In paradoxical cases, the data may produce an inconsistent model. In which case, we recom- 
mend constrained maximization as noted above, and in this way useful information is included 
that would otherwise be ignored. 

In case the Cholesky decomposition is unable to finish and the Schur complement is of the 
Second Special Form, then the particular linear combination of elements corresponding to v can 
be constrained to zero by adding a Lagrange multiplier term Av as indicated above. An alternative 
method involves adding a quadratic penalty term to the likelihood (see Section [6]) can be used to 
adjust the constraints to zero within estimated precision. 

Case 1 and Case 2 above represent possible objective functions. However, we concentrate on 
Case 2 because that is the one relevant to our example. The objective function to be maximized 
is of the form F 2 (L,A). The three algorithms of Section [11] are used to construct the Newton- 
Raphson linear system. If required, the first derivatives with respect to the Lagrange multipliers 
(the A vector) come directly out of the v elements of the corresponding Schur complement of 
the unfinished matrix L. Table 1 sketches the Cholesky decomposition which is used to calculate 
the components of the ReML likelihood and the possible non-stochastic linear combinations that 
will be subject to the constraint conditions. We initialize the array F of Table 2 to the array 
<9_F 2 (L, X)/dLij that represents the constrained objective function including the Lagrange multi- 
pliers. We then find the mixed second derivatives directly from Q (see Table 3, which represents 
the forward derivative calculations.) 



6 Constraints: Penalized Likelihood Method 

There is another method of implementing constraints besides Lagrange multipliers. Instead of 
introducing a set of Lagrange multipliers which are treated as independent parameters to be 
varied in a maximum likelihood procedure, the terms which they multiply can be squared and 
then multiplied by a fixed constant. This is known as the "Penalized Likelihood" method. In the 
Penalized Likelihood method the Lagrange equations for Case 1 and Case 2 above are modified 
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to the form 



Case 1 : Fi(L) = constant — ^ In \Li^\ + -L 2 k k + ^ Ci(H/3i — v)f, 

i<k 

where the q are nominated constants and maximization proceeds with with respect to the non- 
extraneous parameters (si, s 2 , s 3 , s 4 ) and f3\. 

Case 2 : F 2 (L) = constant — In \L iti \ + -L>l k + Ci(vi) 2 , 

i<k 

where C; are nominated constants and maximization proceeds with respect to the non-extraneous 
parameters (si, s 2 , S3, S4). The squares of the column vectors in the last term on the RHS of the 
above are to be understood as vector dot products. 

The main difference between the Lagrange Multiplier method and the Penalized Likelihood 
method is that the auxiliary constraints are imposed to the maximum machine precision by the La- 
grange Multiplier method, whereas the constraints are imposed in a less extreme manner with the 
Penalized Likelihood method. The point of the nominated constants is to adjust the enforcement 
of the constraints to within reasonable limits set by the known precision of the measurements. 
This allows the possibility, for example, of not demanding that the constraints be satisfied with 
more rigor than the measurement uncertainties can justify. 



7 Estimating the Fixed Effects ft 



( jSiegell . Il965l ) described a method to compute generalized least squares (GLS) estimates by way 
of a system of equations with a symmetric and indefinite coefficient matrix. In our notation, this 
system of equations is presented below: 





wher e fi is the G 



iS estimate of (3, and A are the Lagrange multipliers. This Section will fol- 



low (jSmithl . l2001a[ ). and show how to calculate both /3 and A as adjunct operations that depict 



backward substitution, given that the Cholesky decomposition of K is available. 
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Note that the both the coefficient matrix and right-hand side of Siegel's equations are sub- 
matrices of K: the coefficient matrix is the lead sub-matrix in K, and the right-hand side is fixed 
to the last column (or row) and remains there for all permitted row-column permutations of K. 

Now rewrite Siegel's equations in simple terms: 

Cb = r, 

where C and r signify the coefficient matrix and right-hand side, respectively; and b is a column 
vector containing A and (3. To solve these equations, we might permute the rows and columns of 
C, and compute the Cholesky decomposition: TT' = C. To permute the rows and columns of 
C, and the rows of b and r are also permuted to leave Siegel's equations intact. Now multiplying 
both sides of Siegel's equations by the same elementary row operations that transform C into T' 
gives: 

T'b = a. 

When C is non-singular, a = T _1 r. The coefficient matrix in this new system of equations is 
upper triangular. Therefore, vector b can be solved by backward substitution. 

With L computed, where K = LL/, note that T is the leading sub- matrix in L and a' is the 
last row vector of L (excluding the last diagonal). Therefore, having computed L we need only 
enter backward substitution to evaluate b. The GLS estimates (3 will be found scattered in b, 
noting that b is permuted. 

When the z-th pivot encountered in L is zero, and the z-th column vanishes, a singularity is 
present and b is not estimated uniquely. We place a restriction on b: set its z-th entry to zero. 
This modification is implemented when the Cholesky decomposition is unable to finish but ends 
with a Schur complement of the Second Special Form. The associated column of L will also be 
constrained to vanish. When the Cholesky decomposition ends with the First Special Form then 
an auxiliary estimate of Pi is available from the main optimization and this estimate is used in 
backward substitution given that what had been computed for the rest of L is lower triangular. 
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8 Example: Spatial Errors in Track Extrapolations and 
Vertexing in Higgs — > 4 lepton Searches 



One of the most important decay modes of the Higgs Boson is into 4 charged leptons (Higgs — > 
4 leptons). In a solenoidal detector there is a large magnetic field present which is represented 
as a constant vector, B, which is taken by convention to point in the z-direction. Each of the 
charged leptons will approximately trace out a right-circular (or left-circular) helix with symmetry 
axis also parallel to the ^-direction. Tracks undergo multiple scattering and energy losses as they 
traverse the detector which limit the accuracy of the helical path assumption. These effects are 
usually very small for tracks in the central solenoid region that have high transverse momentum 
Pt (momentum component perpendicular to the z-axis). The radius of a track's helical path 
depends its pt- Each helix in 3-d has the form f(s) = [x(s),y(s), z(s)] and is a function of an 
independent position parameter s that marks the location along the track path. The components 
of the position vector f(s) are given by 



z(s) = Z + scot9 

The parameters on the RHS of Eq. ([7]) are obtained from the track-fitting algorithm and have the 
following meaning: 



q = charge of particle 

k = curvature of track circle, R = 1/k = radius of curvature 

Z = z-coordinate of the point on the track helix closest to z-axis 

9 = polar angle of the tangent to track at Z 

<fi = azimuthal coordinate (tan</> = — x/y) of the track helix at Z 




(p — dxy) sin — p sin(0 — s/p) 
pcos(0 — s/p) — (p — dxy) cos</> 



(7) 



P 



qR 



D 



signed distance from beam axis to track helix at Z 



dxy = qD 



dsz = Zj sin(6 ) ) 
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where the sign of the distance for D is given as positive if the track circle (projection of helix into 
the x-y plane) contains the z-axis and negative otherwise. A typical track fitting program will pro- 
duce the 5-parameters (p, <fr, 6, dxy, dsz) along with a 5-by-5 covariance matrix which can be used 
to estimate the spatial error matrix in terms of the 3-by-3 correlation matrix for x(s),y(s), z(s), 
for example, by the method of "propagation of errors" . Once these track parameters are measured, 
then the track can be extrapolated to any position along its trajectory by using the parametric 
equation of a helix as a function of s. One of the interesting properties of these spatial extrap- 
olations is the the 3-by-3 spatial error matrix is very close to rank-2 (almost singular). Fig. (TjQ) 
illustrates the reason that the spatial error matrix has a near-zero eigenvector lying almost entirely 
in the x-y plane. If the radius of the track circle is much larger than the radius of the tracking 
detector, then only a small fraction of the circumference of the track is measured. This creates a 
measurement or observational bias in the sample of hits along the track. Such an observational 

Effects of Observational Bias 




Figure 1: Relation of Error Ellipsoid of Center of Curvature and Tracking Hits. 



bias creates a highly eccentric error matrix for the reconstructed x and y coordinates of the center- 
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of-curvature. This highly eccentric error matrix produces a spatial error matrix along the track 
helix which has a very small eigenvalue in the direction of the projection of the track tangent 
vector in the x-y plane. This property is independent of the method used to determine the track 
parameters precisely because the tracks we are most interested in have high px and, therefore, 
have very large diameter track circles compared to the diameter of the tracking chamber. 

In order to determine if a Higgs candidate has been selected in the data-analysis of the exper- 
iment, one of the criteria applied to the 4 selected tracks is: Are these 4 tracks consistent with 
originating at a common position in space? Do these 4-tracks form a "vertex" in space? If the 
hypothetical Higgs Boson decayed into 4 charged leptons, then each lepton would follow its own 
helical trajectory, but all 4 leptons would converge on a common point in space corresponding to 
the point of decay of the Higgs Boson. Reducible 4-lepton background events, would be inconsis- 
tent with originating from the same common location. Therefore constructing a likelihood for the 
hypothesis that the 4 tracks originate at the same location is a very useful algorithm to separate 
signal events (Higgs candidates) from reducible backgrounds. 

The tracks are assumed to come from a common point by hypothesis and the likelihood is 
obtained by first constructing the K matrix of Eq. above with the yj(sj) = [x(si),y(si), z(si)] 
for coordinates x(si), y(si), and z(si) for each track (i is numbered 1 through 4) with its respective 
independent variable Si,S2) s 3 or s 4- The R matrix in Eq. ([6]) is obtained by the "propagation 
of errors" method which involves Taylor expansion of the functions x(s), y(s) and z(s) about the 
mean positions. The overall matrix is obtained by appending each of the contributions from each 
track to the form the assembly y = [y 1; y 2 , y3, y4] and similarly for R. The incidence matrix 
X is similarly constructed using Xi= 3-by-3 unit matrices (for % = 1,2,3,4) and stacking four 
such matrices one-above-the-other. Given 3-by-3 unit incidence matrices for Xj = I3 (for the i-th 
track), then (3 = f c . The extraneous parameters /3 are the central position coordinates of the 
vertex which can be estimated using the method of Section [7J This relationship completes the 
definition of the Linear Model. 

Since the extrapolated spatial error matrices have the nearly rank-2 property, it is necessary to 
utilize the methods described above to construct a consistent ReML for testing the hypothesis of 
the common spatial origin of the 4 selected leptons as well as automatically constructing the non- 
stochastic linear combinations which have to be constrained to zero (a la the Second Special Form 
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mentioned above). The constraints are determined based on comparing the size of the respective 
diagonal element in the Cholesky decomposition to a tunable threshold value which is optimized 
for signal events. When such a zero is encountered, the location on the main diagonal is noted and 
stored for subsequent determination of the constraint itself. The ReML likelihood and collateral 
constraints are determined as outlined above and the system of first and second derivatives are 
calculated using the Tables 1, 2, and 3 of the appendix. A linear system of equations is then 
formed by Taylor expansion of the objective function and a step is taken by Newton-Raphson 
(NR) iteration to move towards the maximum likelihood position. The coordinate functions of 
Eq. (CO), evaluated at s — 0, produce the coordinates of the helix at closest approach to the beam- 
axis (z-axis). Since a real Higgs would be produced very close to the colliding beam axis, then 
si, S2, S3, S4 should all be initialized to at the beginning of the first NR step for fast convergence. 
Alternatively, one could perform a binary search about s = (for each track) to determine 
appropriate initial values for the NR method. 

After convergence, the algorithm produces a set of four parameter values s\, S2, S3, S4 which give 
the location along each track such that the assumptions of the model are satisfied. The x 2 value 
is obtained as the square of the last element on the main diagonal of the resulting decomposed K 
matrix (x 2 = —L\ k ). This variable is only approximately x 2 coming with degrees of freedom that 
are underestimated (by the number of positive pivots minus the number of negative pivots plus 1, 
minus 4 from estimating the position variables s±, S2, S3, S4) and we use the empirical distribution 
for x 2 m applications. ReML and x 2 are invariant with respect to the coordinates of the central 
position (the vertex), because the impact of the central position was removed from the likelihood. 
The crucial assumption of the the model is that there be a common origination point for the four 
tracks somewhere. If the x 2 from the fit is small, the model is satisfied and the hypothesis that 
the 4-tracks originated from a common point is consistent with the data. This is what we expect 
for Higgs — > 4 leptons. Reducible background events (such as tt, Zbb —> 4 leptons) where the 
4-leptons do not originate from the same point will not fit the hypothesis of the model and will 
have large x 2 values. 

In order to implemen t the above algorithm a sample of events was generated using the PYTHIA 



(jSjostrand. et all 120061 ) Even t Generation 



p rogra m. Th ese events were t 



LDT Monte Carlo Program (IRegler. et all 120071 ) and (IValentan. et al. 



ren s imulated using the 



201 ll ) to determine the 
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detector response and to obtain fits to the track parameters and their covariance matrix. The 
analysis of the fitted tracks and th eir error matr i ces w as done using the Rave/ Vertigo datahar- 



201 lh with the above ReML algorithm inserted 



vesting and vertexing environment flWaltenbergerl . 
internally in the package. 

Applying the ReML method to the case of 4 leptons from Higgs decay and also from a known 
major source of background (it — > 4 leptons), where the leptons can be either electrons or muons, 
we arrive at a comparison of signal and background shown in Fig. ([2]). Only events in which all 
four tracks had pt > 5 GeV/c were used. 
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Figure 2: Red histogram: Expected number of Higgs events at C = 7.62 femptobarn as a 
function of the ReML \fx*. Blue histogram: Expected number of it background events at the 
same luminosity. 



As can be seen Fig. (j2J), even though the Higgs signal events have a smaller cross section and 
a smaller number of expected events than it events, they are much more peaked near zero 
for the same acceptance conditions of the detector. This results in a much higher efficiency (CDF 
value) for detecting Higgs as compared to retaining background events as can be seen in Fig. fl3]). 
This means that we can preferentially select Higgs candidates and reject it background by cutting 
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Figure 3: Left: Red histogram: Cumulative Distribution Function for Higgs Boson events as a 
function of the ReML \f~)^. Blue histogram: Cumulative Distribution Function for it background 
events. Right: Cumulative distributions plotted against each other at the same yx^. The CDF 
for accepting Higgs — > 4 lepton events is shown versus CDF for accepting tt — > 4 lepton events 
based on the x 2 from 4-track vertexing. 



on the x 2 value. 

In the case of high luminosity collisions, there can be multiple events detected every time the 
detector is triggered. These multiple events are recorded by the detector in the same time window 
and are called "pile up". The ReML method, as well as other 4-track vertexing methods, have 
the advantage of being robust against the effects of high pile up since it does not depend on an 
auxiliary determination of the best vertex (the Primary Vertex) that matches with the 4-tracks 
used in the above analysis. Finding the correct Primary Vertex increases with difficul ty as the 
proto n-proton collision luminosity increases. Discrimination methods, such as b-tagging (jTomalinl . 



20081 ) . will be adversely affected as the number of pile up events increases. At the LHC design 



luminosity, approximately 200 pile up events per beam crossing are expected. The issue of finding 
the best Primary Vertex is irrelevant to the 4-track vertex method. If one of the 4-tracks does not 
match with the other 3 because it is from an un-related pile up event, then both the b-tagging 
algorithm and the 4-track vertex method will assign a large x 2 to that event and it will be rejected. 
However, if the wrong Primary Vertex is used as a reference for the b-tagging then all 4-tracks 
will register a large x 2 even though they might be matched to a common spatial position by 
the 4-track vertex algorithm. These events should not be discarded just because the wrong and 
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irrelevant Primary Vertex was used as a reference marker. If the 4-tracks have a small value of \ 2 
from the 4-track vertex method, then that alone is sufficient to decide if they should be further 
analyzed. 

9 Conclusion 

This paper presents an automated matrix procedure for constructing the ReML likelihood function 
and also to identify non-stochastic linear combinations that represent quantities that must be 
constrained to zero in the event that the error matrix describing the data is singular. The method 
is applied to the problem of determining how close four tracks from a Higgs — > 4 lepton decay 
approach a common position in space. This method can be used to discriminate between signal 
and background events at experiments at the LHC It has an advantage of being independent of 
pile up which is not the case for b-tagging. 
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11 Appendix: Pseudocode for Differentiation 

of Cholesky Decomposition 

This S ection includes corrections (shown in red) to the pseudocode which was given in ((Smith 



2001al ). Construction of the Cholesky decomposition for an indefinite matrix K^xat is summarized 

in the following table 
Initializations: 

L is lower triangular and Ly <— ij-th element of Kjvxiv^i) #2, ■ »,Xp)- 

Qk = "+" if k-th diagonal is part of non-negative submatrix, "— " otherwise. 



Algorithm: 

For k—1, ...,N do 

if \Lkk\ ~ zero, check to see if remaining Ljk « zero, j = k + 1, N and skip k-th pivot. 

Otherwise, Lkk ^— sqrt[QkLkk] and do: 

L jk <- QkLjk/Lkk, for j = k + 1, N, 

Lij Lij ± k {L jk x L ik ), for j = k + 1, N & i = j, N. 

end k 



Table 1. Pseudocode for Cholesky Decomposition with Possible Negative Diagonals. 
First derivatives, dF(L)/dx v , are summarized in the following table 
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Initializations: 

L is provided (see Table 1). 

Fnxn is a work space with elements F^, defined respectively for i> j. 

Qk = "+" if k-th diagonal is part of non-negative submatrix, "— " otherwise. 

=tfe = _ Ofc- 

<- dF(L)/dLij 



Algorithm: 

(a) F i— T(F) by following operations: 

For k = N, 1 (N.B. Decreasing Order) do 

if \Lkk\ > zero, then do: 

F ik <r- F ik ± k (F^ x L jk ), F jk <r- F jk ± k {F^ x L ik ), for j = k + 1, N Sz i = j, N 
F jk <- e k F jk /L kk , F kk <- F kk ± k (F jk x L jk ), for j = k + 1, N 



F kk 4— Q k (l/2)F kk /L kk 



end k 



(b) dF(L)/dx. 



V 



Ei>j F ij x dKij/dx v , v = 1, 2, ...p. 



Table 2. Pseudocode for Backward Differentiation of F(L). 



Second derivatives, d 2 



F(L) /dx v dx u , are summarized in the following table 
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Initializations: 

L is provided (see Table 1). 

Qk = "+" if fc-th diagonal is part of non-negative submatrix, "— " otherwise. 
±fe = — Ok- 

F <- T{dF{ 1 L)/dL ij ) provided (see Table 2). 

Satxtv and Qatxtv are work spaces with elements Sij, Qij, defined respectively for i > j. 
Qij <- dK i:j /dx v , i>j, v e {1,2, ...,p}. 

Algorithm: 

(a) For k = 1, N do forward sweep 
if \Lkk\ > zero, then do: 

Qkk *— Ofc(l/2) x Qkk/Lkk, 
Skk 4— ±fc2 x QkkFkk, 

Qjk 4— [QkQjk — QkkLjk]/ Lkk, 
Sjk 4— ±kQkk X Fjk, 

Skk <- S k k ±k (Qjk x Fj k ), for j = k + 1, N 
Qij 4— Qij ±fc (Qik x Ljk) ifc \LikQjk)i 

S ik 4- S ik ± k (Fij x Q jfc ), <- S 1 ^ ± fe (F y x Q ik ), for j = k + 1, N & z = j, JV 

(b) for i > j, % <- % + Em>« x d 2 F (L) j dL mn dL iy 

(c) reverse sweep, S T(S) (see Table 2). 

(d) d 2 F(L)/dx v dx u = J2i>j S ij x dKij/dx u + £\>,- Fy x d 2 K ij /dx v dx u , u = 1, 2, ...,p. 



Table 3. Pseudocode for Backward Differentiation Applied Twice to F(L). 
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